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DESCRIPTION 
Scoring of Test Units 



TECHNICAL FIELD 

The invention relates to a method and a system for scoring 
text units (e.g. sentences), for example according to 
their contribution in defining the meaning of a source 
text (textual relevance) , their ability to form a 
cohesive subtext (textual connectivity) or the extent and 
effectiveness to which they address the different topics 
which characterise the subject matter of the text (topic 
aptness). 

BACKGROUND ART 

When abridging a text it is desirable to aelaat a portion 
of the text which is most representative in that it 
contains as many of the key concepts defining the text 
as possible (textual relevance). As an example, in 
EP-A- 741364 (Xerox Corp* ) disclosed a method of selecting 
key phrases from a machine readable document by (a) 
generating from the document a multiplicity of candidate 
phrases (units of more than one word) , followed by (b) 
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selecting as key phrases a subset o£ the candidate phrases . 
This selection, known as summarisation, may also take into 
consideration the degree of textual connectivity among 
sentences so as to minimise the danger of producing 
summaries which aontaln poorly linked sentences. 
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Computing lexical cohesion far all pair-wise text unit 
combinations in a text provides an effective way of 
assessing textual relevance and connectivity in paral lei, 
see for example Hoey M. (1991) Patterns of Lexis in Text. 
OUP, Oxford, UK; and Collier A. (1994) A System for 
Automatic Concordance Line Selection. NEMLAP 1994, 
Manchester, UK. A simple way of computing a lexical 
cohesion for a pair of text units is to count nan- stop 
words which occur in both text units. Non-stop words can 
be intuitively thought of as words which have high 
Informational content. They usually exclude words with 
a very high frequency of occurrence, e.g. closed class 
words such as determiners, preposition and conjunctions, 
see for example. Pox, c. (1992) Lexical Analysis and 
Stopllets, in Frakes W and Baeza- Yates R (eds) 
Information Retrieval: Data Structures & Algorithms. 
Prentice Hall, Upper Saddle River, NJ, USA, pp 102-130. 

A sample list of stop words is given below:- 

a about above across after again against all almost alone 
along already also although always among and another any 
anybody anyone anything anywhere are area areas around 
as ask asked asking asks at away b back backed backing 
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backs be became because become becomes been before began 
behind being beings best better between big both but by 
c came can cannot case cases certain certainly clear 
clearly come could d did. differ different differently 

5 do does done down downed downing v very w 

want wanted wanting wants was way ways we well 
walls went were what when where whether which while 
who whole whose why will with within without work 
worked working works would X y year years yet you 
10 young younger youngest your yours z 

Text units which contain a greater number of shared 
non-stop words are more likely to provide a better 
abridgement of the original text for two reasons t 

15 

the more often a word with high informational content 
occurs in a text , the more topical and germane 
to the text the word is likely to be, and 

20 the greater the times two text units share a word, 

the more connected they are likely to be. 

As an illustrative example, consider the ranking of the 
following sample text, where digits surrounded by hash 
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characters (#) are text unit indexes. 

#1# Report: Apple looking for a Partner 

#2# NEW YORK (Reuter) - Apple is actively looking for 
a friendly merger partner, according to several 
executives close to the company, the New York 
Times said in Thursday. 

#3# One executive who does business with Apple said 
Apple' employees told him the company was again 
in talks with Sun Microsystems , the paper said. 

#4# On Wednesday, Saudi Arabia s Prince Alwaleed Bin 
Talal Bin Abdulazlz Al Saud said he owned more 
than five percent of the computer maker s stock, 
recently buying shares on the open market for a 
total of 5115 million. 

#5# Oracle Corp Chairman Larry Ellison confirmed on 
March 27 he had formed an independent Investor 
group to gauge interest in taking over Apple. 

#6# ffhe company was not immediately available to 
comment , 

To compute lexical cohesion according to the method 
suggested by Hoey, (see above reference), all unique 
palrwlse combinations of text units are scored according 
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to how many words they share , as shown in the table 
below. 



Text unit pairs 


Words shared 


Score 




#2# 


Apple, look, partner 


3 


#3# 


#5# 


Apple, Apple 


2 


#1# 


#3# 


Apple, Apple 


2 


#3# 


#6# . 


company 


1 


#l# 


#4# 




0 




#5# 




0 


#1# 


#5# 


Apple 


1 


I' lit 


#6#, 


0 


#1# 


#6# 




0 


#5# 


#6# 




0 


#2# 


#3# 


Apple, Apple, executive, company 


4 


#2# 


fr*rrr 


0 


#2# 




Apple 


1 


#2# 


#6# 


company 


1 


#3# 


■I i ii . 




0 



The number of shared words (including multiple occurren 
cea of the same word) in eaoh text unit pair provides the 
individual score for that pair. For example, the individu 
al scores for all pairs Involving text unit #2# ares- 





#1# 


#2# 


#3# 


ff t \JT 


#5# 


#64 


#2# 


3 




4 


0 


1 


1 



Table l 



\ 
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The final score for a given text unit is obtained by summing 
the individual scores for that text unit. According to 
Hoey (see above reference), the number of links (e.g. 
shared words) across two text units must be above a certain 
threshold for the two text units to achieve a lexical 
cohesion rank. For example, if only Individual scares 
greater than 2 are taken into account, the final score 
for text unit #2# is ( 3+4*=) 7 . Proceeding in the same way, 
the final scores for text units #1# and #3# are 3 and 4 
respectively. 

Such a scoring provides the following ranking: 

first: text unit #2# (final scores 7)* 
second: text unit #3# (final score: 4); and 
third: text unit #i# (final scores 3). 

A text abridgement can be obtained by selecting text units 
in ranking order according to the text percentage 
specified by the user. For example, a 35* abridgement of 
the text (ie. an abridgement of up to 3 5% of the total 
number of text units in the sample text) would result In 
the selection of text units #2# and &3#. 
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Further details about lexical cohesion and the ways In 
which it can be used to aid summarisation can be found 
in Hoey and Collier references mentioned above. 

Other prior art on related technology includes , Doi (1991) 
Method and apparatus for producing an abstract o£ a 
document - OS patent 5077668; Ukita et al. (1993) Digital 
Computing Apparatus for Preparing Document Text - us 
patent S2S71B6 ; Withgott et al. Method and apparatus for 
Summarising documents according to theme - US patent 
5364703; and Padersen , J. & J. Tukey (1997) Method and 
Apparatus for Automatic Document Summarisation - us 
patent 5638543. 

DISCLOSURE OF INVENTION 

It is an object of the invention to provide a method and 
system for ranking text units which overcomes at least 
some of the disadvantages of the prior art. 

According to the invention there is provided a method of 
operating on a text including a plurality of text units, 
each including one or more strings, the method including 
the steps ofs 
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forming a structure for each of at least some of said 
strings , in which structure the string is associated with 
each text unit in which the string occurs; 

for each text unit summing the number of occurrences of 
each other text unit in the same structure or structures 
so as to form an individual score for each pair of text 
units • and 

processing said individual scores for each text unit in 
order to form a final score for each text unit* 

The use of such structures considerably reduces the time 
taken to operate on the text because it is no longer 
necessary to count the number of strings shared between 
all possible pairs of text units in turn. 

More specifically, the degree of connectivity of a text 
unit with all other text units in a text can be simply 
assessed by quantifying the elements (e.g. words) which 
each text unit shares with pairs built by associating each 
element in the text with the list of pointers to the text 
units in which the element occurs* This provides a 
significant advantage in terms of processing speed when 
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comparad to a method such as the one described by Hoey 
(1991) and Collier (1994) where the same assessment is 
carried out toy computing all pairwise combinations of text 
units- In particular, the word-per-second processing 
rate is significantly less affected by text size. 

The method may include the further step of ranxing the 
text units on the basis of said individual scores. 

In one embodiment of the invention, said text units are 
sentences, said strings are words forming said sentences , 
and the method includes the additional steps of removing 
stop-words, stemming each remaining word and indexing the 
sentences prior to carrying out said summing step, and 
said structures are stem-index records each including a 
stemmed word and one or more indexes corresponding to 
sentences in which said stemmed word occurs. 

In an alternative embodiment, said text is associated with 
a word text Including words, each word being associated 
with one or more subject codes representing subjects with 
which said word is associated, and said strings are subject 
codes associated with said words. 
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In this case the method may comprise the further step of 
Keeping a record of the word spelling associated with each 
occurrence of a subject code in a text unit, and during 
said summing step disregarding occurrences of the same 
subject code in a pair of text units 'if the same word 
spoiling is associated with said same subject code in said 
pair of text units • 

It will be appreciated that each word may have a number 
of possible subject codes, soma of which are contextually 
inappropriate for the context in which the word is being 
used. The last -mentioned feature allows the method to 
perform disambiguation of the subject codes , by 
disregarding occurrences of subject codes which . are 
contextually inappropriate, as will be described in 
greater detail below. 

Said step of disregarding occurrences of subject codes 
may not be carried out for subject codes which relate to 
only a single word spelling in the word text. 

Said processing step may include calculating a level for 
each text unit. In addition to said final score, and said 
level may indicate the value of the highest of said 



WO 99/39282 



PCT/JP99/00259 



11 

individual scores in relation to a threshold value. 

This allows text units to be ranked first according to 
level, and second according to said final, score, if 
desired. 

The Invention also provides a storage medium containing 
a program for controlling a programmable data processor 
to perform the method described above. 

The invention also provides a system for ranking text 
units in a text, the system including a data processor 
programmed to perform the steps of the method described 
above . 

BRIEF DESCRIPTION OF DRAWINGS 

Preferred embodiments of the invention will now be 
described , by way of example only, with reference to the 
accompanying drawings, in which: 

Figure 1 shows a flow chart outlining some of the 
steps involved in a preferred embodiment of the 
invention. 
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Figure 2 shows a flow chart wh±ch is a continuation of 
the flow chart of Figure l; and 

Figure 3 shows an apparatus suitable for carrying out the 
methods described below. 

BEST MODE FOR CARRYING OUT THE INVENTION 
In an embodiment of the invention described below the 
ranking o£ text units is carried out with reference to 
the presence of shared words across text units. The 
assessment of textual relevance and connectivity can both 
. be carried out by counting shared links (e.g. identical 
words) across all text unit pairs. The method makes it 
possible to perform this assessment by quantifying, the 
elements (e.g. words) which each text unit shares with 
stem-index pairs, each such pair comprising an element 
in the text and a list of pointers to the text units in 
which the element occurs. This technique makes it 
possible .to rank text units at a processing rate which 
is significantly less affected by text size than a system 
where the same assessment is carried out by computing all 
pair wise combinations of text unite. 

The ranking is done by assessing 
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how germane each text unit is to the source text (textual 
relevance ) : 

how well connected each text unit is to other text units 
in the source text (textual connectivity) ? and 

how well each text unit represents the various topics 
dealt with in the source text (topic aptness). 

In a further embodiment described below, the same 
technique is used for assessment of topic aptness . Shared 
links across text units are verified in terms of 
overlapping semantic codes associated with words (e.g. 
the connotations business and government for the 
word executive) with reference to a dictionary or 
thesaurus database providing a specification of such 
codes for word entries. 

The method can be divided into two phases, namely a 
preparatory phase, followed by a ranking phase. In the 
preparatory phase the text undergoes a number of 
normalisations which have the purpose of facilitating the 
process of computing lexical cohesion. This phase 
includes the following operations: 



text segmentation; 
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removal of formatting commands; 
recognition of proper names; 
, . recognition of multi-word expressions; 
removal of stop words; and 
word token Ik at ion . 

Further ways of normalizing the input text are also 
mentioned later in the specif laatlon . 

The object of segmentation is to partition the input text 
into text units which stand on their own (e.g. sentences, 
titles , and section headings) and to index such text units „ 
for example as shown in the sample text given above. 

Next, formatting commands such as the HTML {hyper -text 
mark-up language) mark-ups in the text are dealt with. 

The sample text Including HTML formatting commands looks 
like the (following :- 

<h2>Reportt Apple Looking for a Partner</h2> 
<! Text start --> 
<P> 

NEW YORK (Reuter) - Apple is actively looking for a 
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friendly merger partner, according to several executives 
close to the company, the New York Times said on Thurs 
day. 
<P> 

One executive who does business with^ Apple said Apple 
employees told him the company was again in talks with 
Sun Microsystems , the paper said. 
<P> 

On Wednesday, Saudi Arabia s Prince Alwaleed Bin 
Talal bin Abdulazls Al Saud said he owned more than five 
percent of the computer maker s stock, recently buying 
shares on the open market for a total of $115 million* 
<P> 

Oracle Corp Chairman Larry Ellison confirmed on March 
27 he had formed an Independent investor group to gauge 
interest in taking over Apple. 
<P> 

The Company was not immediately available to comment . 
<!--TextEnd -*> 

In the present embodiment, the formatting commands are 
simply removed, but alternative treatments are mentioned 
below . 
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A facility for recognizing proper names and multi-word 
expressions is also included. Such a facility makes it 
possible to process expressions such as Apple, New York, 
New York Times, gauge Interest as single units which should 
not be further tokenized. The recognition of such units 
ensures that expressions which superficially resemble 
each other, but have different meanings - e.g. Apple (the 
company) and apple (the fruit), or York in New York (the 
city) and New York Times (the newspaper) - do not actually 
generate lexioal cohesion links . For further information 
relating to recognising proper nouns and multi-word 
expressions reference can be made respectively to David 
McDonald (1996) Internal and External Evidence in the 
Identification and Semantics Categorization of Proper 
Names, In B. Boguraev and J. Pustejovsky (eds) Corpus 
Processing for lexical Acquisition, MIT Press and Justeeon, 
J. S. and Katz, S-M., 1995. Technical terminology i some 
linguistic properties and an alorithm for identification 
In text. In Natural Language Engineering, 1 : 9--Z7 . 

Next, all words in the input text which match stop words, 
such as those mentioned above, are removed. This step 
ensures that words which are low in informational content 
are not taken into account when assessing lexical 
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cohesion. After stop-word removal, tha calculation of 
shared words across text units is further optimized by 
tokenlzing non-stop words . word tokenlzation is achieved 
by reducing words into stems ox citation forms, e.g. 



Jnput strings 


stems 


citation forms 


actively looking 


activ look 


active look 



Citation forms generally correspond to the manner in which 
words are listed in conventional dictionaries, and the 
process of reducing words to citation form is referred 
to as lammatisation. Reduction of words to stem form 
generally Involves a greater truncation of the word in 
which all inflections are removed* The purpose of 
reducing words of stems or citation forms is to achieve 
a more effective notion of word sharing, e.g. one which 
abstracts away from the effects of inflectional and/or 
derivational morphology. Stemming provides a very 
powerful word tokenization technique as it undoes both 
derivational and inflectional morphology. For example, 
stemming makes it possible to capture the similarity 
between the words nature, natural, naturally, naturalize, 
naturalizing as they all reduce to.the stem natur. Word 
reduction to citation form would only capture the 
relationship between naturalize and naturalizing . In the 
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present embodiment , stemming will be used. For a de- 
scription of some stemming techniques reference can be 
made to, Frakes W, (1992) Stemming Algorithms , in Frakes 
W and Baeza-Yates R (eds) Information Retrieval: Data 
Structures & Algorithms. Prentice Hall, Upper Saddle 
River, NJ, USA, pp. 131-160. For further information 
relating to lemma t is at ion reference can be made to Hadumod 
Bussmann (1996) Routledge Dictionary of Language and 
Linguistics, Routledge, London, P. 272 Following the 
stages of' stop-word removal and stemming* the sample text 
is as shown below. 



#1# report Apple look partner 

#2# New-York Router Apple aativ look friend merger 
partner accord 

execut close company New -York- Times Thursday 
#3# execut busy Apple Apple employ tell company talk 
Sun-Microsystems 
paper say 
#4# Wednesday Saudi-Arabia Prince 

Alwaleed-fiin-Talal-Bln-Abdulaziz-Al-Saud 
own percent computer maker stock 

recent buy share market total 115 million 
#5# Oracle- Corp Chairman Larry-Ellison confirm 
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March 27 form independent 

investor gauge -interest take-over Apple 
#6# company immediat ' avail comment 

Following the preparatory phase described above, the 
textual relevance and connectivity of each text unit is 
assessed by measuring the number of stems which the text 
unit shares with each of the other text units in the sample 
text. The ranking process comprises two main stages: the 
indexing of tokenized words, and the scoring of tokenlzed 
words in text units. 

In the first stage, all stems in the normalized text, 
which has undergone the preparatory phase described 
above, are indexed with referenoe to the text units in 
which they occur. For example, Apple occurs five times 
in four of the text units in the normalised text 3 once 
in #1#, #Z#, #5# and twloe In #3#. Consequently, a record 
is made wtiere Apple is associated with these text unit 
indexes: 

<Apple <#!#., #2#, #3#, #3#, #5#>> 

A similar record is made for each other stem In the 
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normalised text, each record being referred to as a 
stem-index record. 

A final text unit score ie calculated for each text unit 
using the list of stem- index records resulting from the 
indexing stage described above. The objective of such a 
scoring process is to register how often the tokenlzed 
words from a text unit occur in each of the other text 
units. In performing this assessment , provisions are 
made for a threshold which specifies the minimal number 
of links required for text units to be Considered as 
lexically cohesive. The recursive scoring procedure is 
used to generate the final scores for each text unit makes 
use of the following variables . 

IRfiHL^is the lexical cohesion threshold 
TU is the current text unit 

LcTU i s the current lexical concision score of TU (i.e. 

l»C Tu " is the count of tokenlzed words TU shares 

with some other text unit), 
c Level is the level of the current lexical cohesion 

score calculated as the difference between LC TU 

and TRSH 

Score is the lexical cohesion score previously 
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assigned TU (if any) 
iifiS^l is The level for the lexical cohesion score 
previously assigned to TU (if any) 

The scoring procedure makes us© of a scoring structure 
<level, TU, Soore>, and is repeated for each text unit 
in turn, in order to produce the final score for the text 
unit TU (ie. the final value of LC TU in the scoring 
structure) . The procedure can then he repeated for other 
text unites TU- The recursive scoring procedure used in 
this exemplary embodiment is as follows. 

If LC TU « 0, then do nothing 

olae, if the scoring structure <Level. TU, Score) 
cxi sts f then 

if Level > CLevel, then do nothing 

else, if Level = CLevel, then the new scoring 

structure <Level, TU, Score + LC TU > 

else, if Clevel > 0, then 

if Level > 0 , then new scoring structure is 
<1, TU, Score + LC TU > 

if Level^O, then the new scoring structure 
is <1,TU, LC™> 
else if CLevel^o the new scoring structure is 



WO 99/3928Z PCT/JP99/0Q259 

22 



<CLevel, TD , LC TtJ > 
else (if the scoring structure does not exist then) 
if CLevel > 0, then create the scoring structure <l, 
TO, LCTU > 

else create the scoring structure <CLevel, TU„ tiC TU > 

The above procedure can be more readily understood by 
referring to Figure 1, which shows the procedure in the 
form of a flow chart. In the flow chart decisions are 
indicated by diamond- shaped boxes. If the answer to the 
question within the box Is yes , the procedure follows 
the arrow labelled Y at the bottom of the box, other wise 
the procedure follows the arrow labelled N at one of the 
sides of the box. 

The start of the procedure is indicated by step 10, in 
step 12 the index of the first text unit of the 
normalised text is taken and represented by #T0# . In 
step 14 the index of the last text unit is taken and 
represented by #B# . In the sample text given above, the 
last text unit is text unit #6# . The procedure then 
flows to fctep 16 where the lexical cohesion score of #T0# 
and #B# is calculated and assigned to LC 711 , This lexical 
cohesion score is the Individual score referred to above 
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and shown in Table 1. However, the manner in which it is 
calculated differs from that described above, and will 
now be described. 

Suppose for example, we are scoring text unit #2# (le. 
#TV# o #2#) With a lexical cohesion threshold of 2. First, 
all stem-index records whose stem is present in text unit 
#2# are selected, as shown below. 

<Apple (#1#, #2#. #3#, #3#, #S#}> 
•<company C#2#, #3#. #6#>> 
<execut £#2#, #3#» 
<look #2#>> 
<partner #2#)> 

Stems which are associated with only one text unit index 
are eliminated from this list as they simply occur in a 
text unit, but do not connect a pair of text units. 

Then a tuplet is formed consisting of the index for the 
text unit to be scored for lexical cohesion (i.e. #2#), 
and all the stem-index records whose stem occurs in that 
text unit, as shown below. 

<Apple <*!#, #2#, #3#. #3#. #5#>> 
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<company {#2#, #3#, #6#>> 
< #2# <execut {#2#, #3#}> > 
<look #2#» 
<partnar #2#>> 



Next, identical index occurrences In the tuplet are summed 
together, to give the following results. 





#1# 


#2# 


#3# 


14 * 1 f 


'■ m 


#6# 


#2# 


3 




4 


0 


1 


1 



Table 2 



Index occurrences referring to the text unit being 
assessed (1-e- #2#) are not counted as they do not register 
lexical cohesion (thus the second entry in the table is 
blank). 

The same procedure off forming a tuplet and summing 
identical 'index occurrences is then carried out for each 
other text unit- For example, the tuplet for text unit 
#6# 1st- 



<#6# <company £#2#, #3#, #6#}» 
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This is simpler -than the tuplet for text unit #2# because 
company is the only stem which text unit #6# shares with 
any other text unit- This tuplat gives the 





ntt 




■ #3# 


UAH 

TT tit 


#5# 


#6Z 


#6* 


0 


1 


I 


0 


0 





Table 3 



Thiai method is considerably faster than that of the prior 
art because it does not involve a comparison of every pair 
of text units for each word in the sample text. 

The final cohesion score of text units #2# and #6# is 
calculated by applying the scoring procedure of Figure 
1 to each row in table 2 and table 3 respectively. Scoring 
a text unit according to this procedure Involves adding 
the individual scores which are either above a threshold 
(for Level 1), or below the threshold and of the same 
magnitude (for lower levels) (The use of Levels in the 
procedure is discussed below) . 

Having discussed the way in which individual lexical 
cohesion scores (for each text unit pair) are calculated 
in step 16 using tuplets, we shall return to Figure l to 



WO 99/39282 



PCT/JP99/00239 



26 



follow the procedure for calculation of the final lexical 
cohesion score for each text unit. However, before 
returning to Figure 1 it is noted that the simplest way 
of forming the final score would be to sum the individual 
scores for each text unit (i.e. for #2#'and #6#, sum each 
row in Tables 2 and 3 above), whilst ignoring all 
individual scores below a certain threshold value. 
However, the procedure of Figure 1 goes further in that 
it determines not only a final score for each text unit, 
but also a level for each text unit, as discussed below. 

The highest level is 1 , which indicates that the greatest 
individual score (for a given text unit) is above the 
threshold. The final score for that text unit is then 
simply the sura of all individual scores (for that text 
unit) whiah are above the threshold. 

The meanings of level 1 and the next three levels below 
level 1, and the ways in which the final score for these 
levels is calculated, are shown in the table below. 
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Level 


Meaning of Level 


Final Score 


1 


Greatest individual score > threshold 


i3iun ui an maiviauai scores above 
threshold 


' 0 


Greatest individual score « threshold 


Sum of all individual scores equal to 
threshold. 


-1 


Greatest individual score = threshold —1 


Sum of all individual scores equal to 
threshold - 1 


_2 


Greatest individual score *= threshold -2 


Sum of all individual scores equal to 
threshold - 2 



It will be seen that if threshold = 0, only level 1 
exists, and the final score for a given text unit is 
simply the sum of all individual scores for that text 
unit. In fact the total number of levels is equal to the 
threshold + 1 . 



Some examples of individual scores, and the levels and 
final scores they produce (by following the procedure of 
Figure 1) for a threshold of 2 are given below. 



Individual scores 


Level 


Final Score 


20201 


0 


4 


liooo 


-1 


2 


56200 


1 


11 


1 1 1 1 1 


-I 


5 



The purpose of calculating a level for each text unit is 
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to allow the text units to be ranked first according to 
level (highest level first) and second according to final 
score (highest final score first)* In this way, text 
units having no individual scores above the threshold are 
not necessarily ignored in the subsequent summarisation 
process . 

Returning to Figure 1, in step IB the procedure branches 
into two depending on whether LC TU » 0. where LC TU is the 
lexical cohesion score of the text unit currently being 
considered. A lexical cohesion score of zero between two 
text units (ie. LC TU =o) indicates that the two text units 
do not share any stems. If LC^ V - o then the procedure 
goes to step 20. As discussed below, the text unit index 
#B# is decremented by 1 at step 2 8 during each cycle of 
the procedure. At step 20, if #B# has reached 1 then #TU# 
is incremented by 1 in step 22. That is, the next text 
unit (in this case #2#) is assigned to #TU# . In step 24 
the procedure is stopped (at step 26) if #TU# has reached 
the maximum value +1 (i.e. 6+1 = 7 for our sample text), 
otherwise control passes back to step 14. 

At step 20. if #J3# has not yet been decreased to the 
first text unit (i.e. then control passes to step 
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20, in which #B# is decremented by 1 (le. the next lower 
text unit is assigned to #B#. 

It will therefore be seen that the effect of steps 10 to 
2B is to calculate the individual lexical cohesion scores 
for all pairs of text units. 

Returning to step la. if LC TU does not equal 0, then 
control pasBes to step 30. which determines whether or 
not the scoring structure <Level, TV, Score > already 
exists. The first time that step 30 is reached no 
scoring structure will already exist, and control will 
pass to step 32, which determines whether CLevel is 
greater than 0 . CLevel is the current value of Level and 
is equal to (LC TU - TRSH) , where TRSH is the lexical 
cohesion threshold, which is selected In advance. Xn 
steps 34 and 36 values are assigned to the scoring 
structure according to the outcome of step 32 # and 
control tfien passes back to step 20 - 

At step 30, if the scoring structure already exists 
(which will always be the case except for the first time 
step 30 is reached for each value of FU , given that the 
first time step 30 is reached values are assigned to the 
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scoring structure at steps 34 and 36 us described above) , 
control passes to step 3B which determines whether Level 
(I.e. the previous value of CLavol) is greater than 
CLevel. If so, control passes back to step 20 . Otherwise, 
control passes to step 40, which determines whether 
Level is equal to CLevel. If so, new valuee are assigned 
to the scoring structure in step 42, and control passes 
back to step 20. Otherwise, control passes to step 44 
(see Figure 2 ) , which determines whether CLevel is greater 
than 0. If bo, control passes to step 46, and new values 
are assigned to the scoring structure in step 4a, or step 
50 , depending on whether the level is greater than 0, and 
control passes back to step 20. At step 44, if CLevel is 
not greater than 0, control passes to step 52, which 
determines whether CLevel is less than, or equal to, 0. 
If step 52 is reached, the answer to this question should 
always be yes, so -that new values are assigned tq the 
scoring structure in step 54, and control is passed back 
to step M - 

Following the procedure of Figure 1 for all text units 
in the sample text, and a threshold of 2. the levels and 
final scores assigned to each text unit are as follows:- 
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Text Unit 


Level 


Scare 




1 


3 


#2# 


1 


7 




1 


4 


UAH 




0 


#5# 


0 


2 


#6# 


-1 


2 



These provide the following ranking of text units in terms 
of lexical .cohesion . 



Rank 


Text Unit 


Level 


Scare 


l u 


#2# 


1 


7 




£3# 


1 


4 


3 rd 


#1# 


1 


3 


4 th 


#5# 


0 


" 2 


5 th 


#6# 


-I 


2 


6 th 


11 in 




0 



This shows the preferred order in which the text units 
will be selected in a summarisation process. It is noted 
that no level is assigned to text unit #4#, as this text 
unit shares no stems with any other text unit. 

When used with a dictionary database providing 
information about the subject domain of words the method 
described above can be slightly * modified to detect the 
major themes and topics of a document automatically. 
As an example , the words in our sample text have the 
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following subject domain codes. 



Word 


Associated Codes 


actively-adv 


OR ' " 


business-n 


BZ 


buy-v 


MAT? X/fppn \aj 


confinn-v 




companv-n 


I*,IVU, dL*l_T, 1X1 


cmDloycc-n 


T AT* 


executive-n 


BZ GOV 


fiiendly-adj 


FA O 






independent-adj 


CHT f FA 


interest-n 


BZ, EC, G.J.U 


investor-n 


rv.oN 


look-v 


PHYA 


maker-n 


JC ' 


market-n 


BZ, MAR 


mcrger-n 


MERG 


open-adj 


CER.PFE 


own-v 


MEN 


paper-n 


PAPP 


paitner-n 


DA, F, MGE, TG 


say-v 


CN 


stock-n 


AH, AM, AP, BRE, FLW 




FOO,GU,IV,PM 


take-v 


EC,PG,SH,V,WRI 


talk-n 


RUE 



Trie meanings of these codes are given below:- 
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CODE 



(it* 1 

.0 

.ffl 



AH 
AM 
AP 
BRE 
B2 
CER 
CHR 
CHT 
CN 

DA 
EC 
F 

FA 
FLW 
FOO 
G 

GOV 

GROU 

GU 

IV 

J 

JC 
LAB 
MAR 
MEN 

MERG 

MGE 

MI 

ON 

OR 

PAPP 

PFE 

PG 

PHYA 

PM 

POP 

RHE 

SCO 

SH 

TG 

TH 

U 

v 

WRI 



Explanation 



Animal Farming & Husbandry ' " 

Animal Names (not taxonomic terms (TAXI) 
Anthropoloffif & Ethnology (incl racial groups) 
Breeds and Breeding 
Business & Commerce 
Ceremonies 
Christianity 

Character TVaits (eg. meddlesome, mellow, outgoing) 
Coirnnumcanons(e S . telephony, telegraphy/ 
audiovisual, information science, radio) 
Dance Sc Choreography 
Economics & Finance 
Finance & Business 

Overseas Politics & mtemational Relations 

Flower Names: plants known primarily as flowers 

Foods: all edible items 

Sports (incl Games & Pastimes) 

Government Admin & Organisations (eg reshuffles') 

Groups of Musicians 

Guns 

Investment & Stock Markets 

Crime and the Law 

Judaec-Qiristian Religion 

Staff and the Workforce (incl Labour relations) 

Marketing & Merchandising 

Mental States & Feelings 

(eg. depressed, tense, non-plusaed) 

Mergers, Monopolies, Takeovers, Joint Ventures 

Marriage. Divorce, Relationships & Infidelity 

Military (the armed forces) 

Occupations & Trades 
Organisations, Groups & Orders 
Paper & Stationery 
Banking & Personal Finance 
Photography 
Animal physiology 
Plant Names 
Pop & Rock 

Rhetoric 8l Oratory (eg. ad lib, eulogy, scripted) 

Scouting & Girl Guides " 

Clothing 

Team Games 

Theatre 

Politics, Diplomacy &. Government 

VMtlng^ TranSp ° rt (incl * ta«"P°rt irifrastructure) 
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A further embodiment relating to subject analysis in- 
volves a method which is the same as that described 
above, except that each word is first lemmatisad (rather 
than stemmed), and then replaced by all of the subject 
domain codes associated with that word 1 . The individual 
scores for pairs of text units are then calculated on the 
basis of shared codes rather than shared words , using 
code-index records, rather than stem-index records. 

However, an extra (disambiguation) step is required in 
order to avoid (or at least reduce the chances of) 
counting codes which are out of context, that is codes 
which relate to senses of the word other than the intended 
sense. The disambiguation step involves dropping . text 
unit Indexes from the code- index records of tuplets if 
they relate to the same word as the first element (i.e. 
text unit index) of the tuplet. This re<juires that the 
word associated with each text unit index in each 
code-Index record be remembered (le. recorded) by the 
procedure. This procedure can be demonstrated by the 
following example. 

In the sample text the code BZ (Business & Commerce) is 
associated with the words; 
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executive occurring once In text units #2# and #3# 
business occurring once In text unit #3# 
market occurring once in text unit #4# 



Consequently, a code-index record can be made where the 
subject domain code BZ is associated with these text unit 
indexes , that is : 

<BZ (#2# #3# #3# #4# #5#>> 

The full list of code -index records for the sample text 
is shown below (instances where a code occurs in a single 
text unit are removed as they do not represent lexical 
cohesion links ) . 



Interest occurring once in text unit #S# 



<BZ 



C#2# #3# #3# #4# #S#>> 



<CN 



{#2# #3# #3# #4#» 



<DA 



#2#}> 



<F 



C#l# #2# #2# #3# #6#}> 



{#2# #5#>> 



<GOV {#2# #3#» 



<IV {#4##5#}> 



<MGE {#1# #2#)> 
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<M1 {#2# #3# #6#>> 
<SCG {#2# #3# #6#}> 
<T<5 #2#>> 
<TH {#2# #3# #6#>> 

The finest tuplet (disregarding the disambiguation step 
mentioned above) is then: 

<DA {#!# #2#)> 
<#!# <F #2# #2# #3# #6#>>> 

<MQE {#!# #2#)> 
<TG #2#}> 

and so on for the other tuplets. 

To simplify matters, in order to illustrate the 
disambiguation stop, rather than calculate the individual 
soores for each pair of text units, we Khali consider only 
the contribution to the individual scores which is made 
by one of the codes, for example code BZ. The BZ 
components of all the tup lets are: 



<#2# <BZ {#24- #3# #3# #4# #5#}>> 
<#3# <BZ {#2# #34- #a+ #4# #5#>>> 
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<#4# <BZ {#2# #3# #3# #4# #5#}>> 
<#5# <BZ {#2# #3# #3# #4# *&#}>> 

Where indexes are identical with the first index of each 
* tuplet are shown in etrikethrough to indicate that they 
are excluded, as above * 

When allowance is made for the fact that each index is 
associated with a particular word, the B2 components of 
the tuplet s become t 

<#2 (executive) # <BZ {#3 (business)* #4# #5#>> 

<#3 (executive) # <BZ {#4# #5#>>> 

<#3 (business )#<BZ{#2( executive)*, #4#. #5#>>>. 

<#4# <B2 {#2# #3# #3# #5#}>> 

<#5# <BZ {#2# #3# #3# #4#»> 
Where the disambiguation step is illustrated above by 
showing Indexes relating to words Identical with the first 
index of each tuplet in strlkethrough to indicate that 
they are excluded. The final tuplet s are then: 

<#2 (executive) # <BZ {#3 (business)* #4# #5#>> 
<#3 (executive) # <BZ {*4# #5#)>> rib. #2 (executive) # 
excluded. 

.<#3 (business) # <bz {#2# #4# #5#}>> 
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<#4# <BZ {#2# #3# #3# #5#>>> 

<#S# <BZ {#2# #3# #3# #4#»> 



Jt=. 



Hi 
ID 
0 
ID 



10 



15 



20 



The contribution made by BZ to the individual scores of 
text unit pairs are then as follows: 





#1# 


#2# 


#3# 


UAH 


#5# 


m4 






0 


6 


0 


0 


0 


#2# 


0 




1 


1 


1 


0 


#3# 


0 


1 




2 


2 


0 


II w Wir 


0 


I 


2 




1 


0 




0 


1 


2 


1 




0 


#6# 


0 


0 


0 


0 


0 





When the same procedure is followed for certain other 
codes, such as DA, PA, GOV etc, no valid tuplets result. 
♦This is because the text unit indexes within the 
code-index records for these codes all relate to the same 
word* For example, the code GOV arises from the word 
"executive* which occurs in text units #2# and #3#, thus 
creating the code-index record <GOV (#2# #3#)> mentioned 
above. Because this aode-lndex record does not form a 
valid tuplot, the "Government 11 sense of the word 
"executive* makes no contribution to the individual 
scores mentioned above. We have already seen that the 
Business sense of the word "executive", does maXe such 
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a contribution, which Is the desired result because It 
is the "Business** sense of the word which is intended in 
the sample text- The method thus achieves a degree of 
disambiguation of the subject domain codas , and rejects 
codes which are out of context. 

Only instances where the words related to the same code 
differ in spelling are taken into account. This makes it 
possible to achieve higher precision in individuating 
salient themes /topics and assessing their relative 
importance. Taxing the intersection of code sets for 
words with different spelling occurring in the same 
document tends to exclude contextually inappropriate 
interpretations for the words . 

However, in cases where a word in the sample text is 
associated with only one subject: code, the disambiguation 
step is not carried out because no disambiguation is 
necessary. Hence the code CN. relating to the word "say" 
remains » 

The following table shows the text unit pairs which each 
code connects . 
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CODES 


TEXT UNIT PAIRS '. " 




Bfc 


2-3 2-4 2-5 3-4 3-5 3-2 3-4 3-5 4-2 4-3 4-3 4-5 5-2 5-3 5-3 5-4"" 




F 


1-2 1-3 1-6 2-1 2-3 2-6 3-1 3-2 6-1 6-2 




FA 


2-5 5-2 




IV 


4-5 5-4 




CN 


3-44-3 





Only five codes form valid tuplets, all the other codes 
being excluded (as described above) . 

In total, we havoc 16 text unit pair? for BZ . 10 for F , 
and 2 for FA and IV and CN. These data can be used to rank 
text units in the sample text in terms of topic aptness 
by adaptation of the procedure of Figure 1. 

The total of all Individual scores for each subject domain 
code (eg. 16 for BZ, etc) can be converted into percent age 
ratios to provide a topic /theme profile of the text as 
shown in the table below s- 



50% 


BZ 


Business & Commerce 


31.25% 


F 


Finance & Business 


6.25% 


IV 


Investment & Stock Markets 


6,25% 


FA 


Overseas Politics & International Relations 


6.25% 


CN 


Communications 
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For example, the percentage for BZ ±s calculated as 
16/(16+10+2+2+2) * 50% 

When used In a summarization system, the level-based 
differentiation of text units obtained through the 
ranking procedure of Figure 1 (whether based on wards or 
on codes) can be made to provide an automatic indication 
of abridgement size, for example by automatic selection 
of all level 1 text units. 

Summary sl2e can also be specified by the user, e.g. as 
a percentage of the original text size, the selected text 
units being ahoeen from among the ranked text units with 
higher levels and higher scores . 

The methods described can* also be used as indexing devices 
in various information systems such as information 
retrieval and information extraction systems. For 
example. *Ln a database comprising a large number of texts 
it is often desirable to provide a short abstract of each 
text to assist in both manual and computer searching of 
the database. The methods described above can be used to 
generate such short abstracts automatically. 
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The ranking method described above can also be applied 
taking into account additional ways of assessing lexical 
cohesion, which could be used at step IB of Figure 1, 
such as: 

the presence of synonyms across text units as established 
by consulting an electronic dictionary of synonyms; 

the presence of words sharing the. same semantic indicators 
across text units as established by consulting an 
electronic dictionary, as in the example with subject 
domain codes discussed above; 

the presence of near- synonymous words across text units 
established by estimating the degree of semantic 
similarity between word pairs, as in the method 
disclosed in British Patent Application No . 9717508 - 7 . 

the presence of anaphoric links across text units, i.e. 
links between a referential expression such as a 
pronoun or a definite description (e.g. The aompany 
in text unit #G#, and its antecedent (Apple in text 
unit #5#). 
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The same ranking method described in the preferred 
embodiment can also be applied by using formatting 
commands as indicators of the relevance of particular 
types of text fragments. For example, text fragments 
enclosed in formatting commands encoding titles and 
section headings such as 

<h2>Report: Apple Looking for a Partner</h2> 

typically contain words which can be effectively used to 
provide an indication of the main topic in a text. These 
words can be given extra weight in the above method, and 
thus be used to assign additional textual relevance to 
text unite which contain them, e.g. by increasing 
further the lexical cohesion score of such text units 
during the ranking procedure described above . Formatting 
commands can also be selectively preserved so as to 
maintain as much of the page layout. for the original text 
as possible. 

The ranking method described above can also be applied 
by using lemmatlzing instead of stemming as a word 
tokenization technique , or dispensing with word 
tokenizatioh altogether. 
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The same ranking method can also be applied to texts 
written In a language other than English, by providing 

a list o£ stop words for the language; 

a stemmer of lemmatizer for the language; and 

any additional means for assessing lexical cohesion 

in the language such as semantic similarity and 

anaphoric links 

Figure 3 shows schematically a system suitable for 
carrying out the methods described above. The system 
comprises a programmable data processor 70 with a program 
memory 71, for instance in the form of a read only memory 
ROM, storing a program for controlling the data processor 
70 to perform, for example, the- method illustrated in 
Figures 1 and 2- The system further comprises non- 
volatile read/write memory 72 for storing, for example, 
the list of atop words and the subject domain codes 
mentioned above. Working or scratch pad memory for 
the data processor is provided by random access memory 
(RAM) 73. An input interface 74 is provided, for Instance 
for receiving commands and data. An output Interface 75 
is provided, for instance, for displaying Information 
relating to the progress and result of the procedure . 
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A text sample may be supplied via the input interface 74 
or may optionally be provided in a machine -readable store 
76. A thesaurus and/or a dictionary may be supplied in 
the read only memory 71 or may be supplied via the input 
Interface 74- Alternatively , an electronic or 

machine -readable thesaurus 77 and an electronic or 
machine -readable dictionary 76 may be provided. 

The program for operating the system and for performing 
the method described hereinabove Is stored in the program 
memory 71- The program memory may be embodied as 
semiconductor memory, fox instance of ROM type as 
described above. However, the program may be stored in 
any other suitable storage medium, such as floppy disc 
71a or CD-ROM 71b. 

INDUSTRIAL APPLICABILITY 

The use of the structures according to the present 
Invention considerably reduces the time taken to operate 
on the text because it is no longer necessary to count 
the number of strings shared between all possible pairs 
of text units in turn. 
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More specif ioally, the degree of connectivity of a text 
unit with all other text units in a text can be simply 
assessed by quantifying the elements (e.g. words) which 
each text unit shares with pairs built by associating each 
element in the text with the list of pointers to the text 
units in which the element occurs. This provides a 
significant advantage In terms of processing speed when 
compared to a method such as the one described by Hoey 
(1991) and Collier (1994) where the same assessment is 
carried out by computing all pairwlse combinations of text 
units. In particular, the word-per-second processing 
rate is significantly less affected by text size. 



